Natural Language Information Retrieval: TIPSTER-2 Final Report
نویسنده
چکیده
We report on the joint GE/NYU natural language information retrieval project as related to the Tipster Phase 2 research conducted initially at NYU and subsequently at GE R&D Center and NYU. The evaluation results discussed here were obtained in connection with the 3rd and 4th Text Retrieval Conferences (TREC-3 and TREC-4). The main thrust of this project is to use natural language processing techniques to enhance the effectiveness of full-text document retrieval. During the course of the four TREC conferences, we have built a prototype IR system designed around a statistical full-text indexing and search backbone provided by the NIST's Prise engine. The original Prise has been modified to allow handling of multi-word phrases, differential term weighting schemes, automatic query expansion, index partitioning and rank merging, as well as dealing with complex documents. Natural language processing is used to preprocess the documents in order to extract content-carrying terms, discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and process user's natural language requests into effective search queries. The overall architecture of the system is essentially the same for both years, as our efforts were directed at optimizing the performance of all components. A notable exception is the new massive query expansion module used in routing experiments, which replaces a prototype extension used in the TREC-3 system. On the other hand, it has to be noted that the character and the level of difficulty of TREC queries has changed quite significantly since the last year evaluation. TREC-4 new ad-hoc queries are far shorter, less focused, and they have a flavor of information requests ("What is the prognosis of ...") rather than search directives typical for earlier TRECs ("The relevant document will contain ..."). This makes building of good search queries a more sensitive task than before. We thus decided to introduce only minimum number of changes to our indexing and search processes, and even roll back some of the TREC-3 extensions which dealt with longer and somewhat redundant queries. Overall, our system performed quite well as our position with respect to the best systems improved steadily since the beginning of TREC. We participated in both main evaluation categories: category A ad-hoc and routing, working with approx. 3.3 GBytes of text. We submitted 4 official runs in automatic adhoc, manual ad-hoc, and automatic routing (2), and were ranked 6 or 7 in each category (out of 38 participating teams). It should be noted that the most significant gain in performance seems to have occurred in precision near the top of the ranking, at 5, 10, 15 and 20 documents. Indeed, our unofficial manual runs performed after TREC-4 conference show superior results in these categories, topping by a large margin the best manual scores by any system in the official evaluation. In general, we can note substantial improvement in performance when phrasal terms are used, especially in ad-hoc runs. Looking back at TREC-2 and TREC-3 one may observe that these improvements appear to be tied to the length and specificity of the query: the longer the query, the more improvement from linguistic processes. This can be seen comparing the improvement over baseline for automatic adhoc runs (very short queries), for manual runs (longer queries), and for semi-interactive runs (yet longer queries). In addition, our TREC-3 results (with long and detailed queries) showed 20-25% improvement in precision attributed to NLP, as compared to 10-16% in TREC-4.
منابع مشابه
Enhancing Detection through Linguistic Indexing and Topic Expansion
Natural language processing techniques may hold a tremendous potential for overcoming the inadequacies of purely quantitative methods of text information retrieval. Under the Tipster contracts in phases I through III, GE group has set out to explore this potential through development and evaluation of new text processing techniques. This work resulted in some significant advances and in a bette...
متن کاملRobust Text Processing in Automated Information Retrieval
We report on the results of a series of experiments with a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. In this paper we show that large-scale natural language processing (hundreds of millions of words and more) is not only required for a better retrieval, but it is ...
متن کاملTIPSTER Phase I Final Report
Overview: During Phase I of the TIPSTER program, HNC developed a unique approach to machine learning of similarity of meaning. This approach, embodied in a system called "MatchPlus", exploits this learned similarity of meaning for concept-based text retrieval, routing and visualization of textual information. MatchPlus uses an information representation scheme called "context vectors" to encode...
متن کاملNatural Language Information Retrieval: TREC-3 Report
In this paper we report on the recent developments in NYU’s natural language information retrieval system, especially as related to the 3rd Text Retrieval Conference (TREC-3). The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of term-based document retrieval. The system is designed around a traditional statistical backbone con...
متن کاملThe Cornell TIPSTER Phase III Project
The overall objective of the Cornell University TIPSTER Project was to improve end-user efficiency in information retrieval systems by reducing the amount of text that the user must process [1]. The project focuses on high precision IR, near-duplicate detection and context-dependent summarization. The two main foundations of the research are the latest version of the Smart system for informatio...
متن کامل